SIR Modeling Example

Jacqueline Buros

Generable

Introduction

A little about me

  1. I don’t work in infectious disease modeling, instead I analyze:

    • Clinical trial data from Oncology
    • Bioinformatics (genomics, multi-omics) in Alzheimer’s Disease, Oncology, and others
    • Previously: book sales, marketing campaigns, market research, cardiology, epidemiology, etc.
  2. I’ve always been interested in different modeling approaches. There is a ton of value in cross-disciplinary pollination.

  3. I work in multiple programming languages:

    • Currently Stan, Python and R.
    • Previously: Stata, SAS, even some old VB code.

The SIR Model

This is a 3-compartment ODE model, in which a person at any one time is either:

  • Susceptible to disease
  • Infected (assumed infectious)
  • Recovered (and thus not infectious)

Image source: https://mc-stan.org/users/documentation/case-studies/boarding_school_case_study.html

Compartments

Each of these states is termed a “compartment”, and we model the size of each compartment at time \(t\) as the number of individuals in this state at that time.

It is described by a system of ODEs:

\[ \begin{aligned} \frac{dS}{dt} &= -\beta S \frac{I}{N}\\ \frac{dI}{dt} &= \beta S \frac{I}{N} - \gamma I \\ \frac{dR}{dt} &= \gamma I \end{aligned} \]

The number of infections in any day or week will depend on the sizes of these compartments, and:

  • \(\frac{I}{N}\): the proportion of infected people in the population.
  • \(\beta\): how infectious the disease is. \(\frac{1}{\beta}\) is also the average time between contacts in the population.
  • \(\gamma\): rate of recovery of infected persons. \(\frac{1}{\gamma}\) is the average recovery time per person.

Implications

  1. The total population size is fixed: \(N = S(t) + I(t) + R(t)\) at all times
  2. The \(R\) state is terminal: once a subject is in this state, the subject’s state cannot change.
  3. The dynamics of \(I(t)\) over time depend entirely on \(\beta\) and \(\gamma\), specifically:

\[ R_0 = \frac{\beta}{\gamma} \]

This number is called the basic reproduction number, reflecting the expected number of new infections from a single infected person in a population of susceptible people.

src: https://en.wikipedia.org/wiki/Measles

Before we get started

I’m not going to go into details regarding Bayesian inference, statistics, or Stan.

If none of these are familiar to you, treat this as a possibly motivating example. There are lots of great resources on Bayesian analysis and Stan on the internet; nothing I present here will help you that much.

Also: Not all the code has been shown here in the slides. See the source qmd file for details.

Basic SIR model

Install packages

Before we get started (if you want to follow along):

  1. Install cmdstanr according to documentation
  2. Use cmdstanr to install CmdStan: install_cmdstan()

Defining the SIR Model

We are going to start by implementing the ode function body in Stan code:

functions {
  real[] sir(real t, real[] y, real[] theta, 
             real[] x_r, int[] x_i) {

      real beta = theta[1];
      real gamma = theta[2];
      
      real S = y[1];
      real I = y[2];
      real R = y[3];
      real P = S + I + R;
      vector[3] dy_dt;

      dy_dt[1] = -beta * I * S / P;
      dy_dt[2] =  beta * I * S / P - gamma * I;
      dy_dt[3] =  gamma * I;
      return(dy_dt);
  }
}

Population Dynamics

We can now write a minimal Stan program that will allow us to simulate different population dynamics.

We write all our methods as functions, so we can re-use them later.

We will save this in a file: sir-simulation.stan

functions {
  real[] sir(real t, real[] y, real[] theta, 
             real[] x_r, int[] x_i) {

      real beta = theta[1];
      real gamma = theta[2];

      real S = y[1];
      real I = y[2];
      real R = y[3];
      real N = S + I + R;
      real dy_dt[3];

      dy_dt[1] = -beta * I * S / N;
      dy_dt[2] =  beta * I * S / N - gamma * I;
      dy_dt[3] =  gamma * I;
      return(dy_dt);
  }
  
  real[,] simulate_sir(real[] t, data int N, data int i0, real beta, real gamma) {
    int N_t = num_elements(t);
    real y0[3] = {N - i0 + 0.0, i0 + 0.0, 0.0};
    real t0 = 0;
    real theta[2] = {beta, gamma};
    real y[N_t, 3];
    y = integrate_ode_rk45(sir, y0, t0, t, theta, rep_array(0.0, 0), rep_array(0, 0));
    return(y);
  }
}
model {
}

Simulating from the model

We can now use the rstan package to expose these as R functions.

This is a powerful tool for model interrogation; the functions are written in Stan code, transpiled to C++, and exposed to R through Rcpp.

library(rstan)
expose_stan_functions('model/sir-simulation.stan')
t <- seq(from = 0, to = 25, by = 0.1)
sim_df <- simulate_sir(t = t,
                   N = 1000,
                   i0 = 1,
                   beta = 4,
                   gamma = 1/4
                  ) %>%
  map(set_names, c('S', 'I', 'R')) %>% 
  map_dfr(as_tibble_row, .id = 't_index') %>%
  mutate(t_index = as.integer(t_index)) %>%
  left_join(tibble(t = t) %>% mutate(t_index = row_number()))

sim_df %>%
  tidyr::gather(state, N, S, I, R) %>%
  ggplot(aes(x = t, y = N, colour = state)) +
  geom_line(linewidth = 2) +
  ggtitle('Simulated Population Dynamics') +
  theme(text = element_text(size = 15))

Result

Data likelihood

Of course, the data that are observed are rarely this clean.

Let’s assume we have data for daily counts of cases, and that we will model this as a negative binomial observation model.

We add a new function to our data simulation for the observed data:

  int[] simulate_cases_rng(real[] t, data int N, data int i0, real beta, real gamma, real phi) {
    int N_t = num_elements(t);
    real y[N_t, 3] = simulate_sir(t, N, i0, beta, gamma);
    int cases[N_t] = neg_binomial_2_rng(y[,2], phi);
    return(cases);
  }

Update our simulation

Now, we can update our simulation to include both the assumed “true” population incidence and the observed counts:

expose_stan_functions('model/sir-simulation.stan')
case_t <- seq(from = 0, to = max(t), by = 1)
sim_cases <- simulate_cases_rng(t = case_t,
                   N = 1000,
                   i0 = 1,
                   beta = 4,
                   gamma = 1/4,
                   phi = 10,
                  ) %>%
  tibble(cases = .) %>%
  bind_cols(t = case_t)

Using rstan::expose_stan_functions to integrate this ODE system and sample daily counts:

Result

Testing the model

Fit to simulated data

Next, we evolve our Stan program to include estimating the likelihood (along with the priors) given data.

This will allow us to fit the model to our simulated data.

basic_sir <- cmdstanr::cmdstan_model(here::here('model/sir-model.stan'))
basic_sir$print()
functions {
  real[] sir(real t, real[] y, real[] theta, 
             real[] x_r, int[] x_i) {

      real beta = theta[1];
      real gamma = theta[2];

      real S = y[1];
      real I = y[2];
      real R = y[3];
      real N = S + I + R;
      real dy_dt[3];

      dy_dt[1] = -beta * I * S / N;
      dy_dt[2] =  beta * I * S / N - gamma * I;
      dy_dt[3] =  gamma * I;
      return(dy_dt);
  }
  real[,] simulate_sir(real[] t, data int N, data int i0, real beta, real gamma) {
    int N_t = num_elements(t);
    real y0[3] = {N - i0 + 0.0, i0 + 0.0, 0.0};
    real t0 = -0.1;
    real theta[2] = {beta, gamma};
    real y[N_t, 3];
    y = integrate_ode_rk45(sir, y0, t0, t, theta, rep_array(0.0, 0), rep_array(0, 0));
    return(y);
  }
  
  int[] simulate_cases_rng(real[] t, data int N, data int i0, real beta, real gamma, real phi) {
    int N_t = num_elements(t);
    real y[N_t, 3] = simulate_sir(t, N, i0, beta, gamma);
    int cases[N_t] = neg_binomial_2_rng(y[,2], phi);
    return(cases);
  }
}
data {
  int N_t;
  real t[N_t];
  int cases[N_t];
  int N;
  int i0;
}
parameters {
  real<lower = 0> beta;
  real<lower = 0> gamma;
  real<lower = 0> phi_inv;
}
transformed parameters {
  real phi = 1. / phi_inv;
  real ys[N_t, 3] = simulate_sir(t, N, i0, beta, gamma);
}
model {
  // priors
  phi_inv ~ exponential(5);
  beta ~ normal(0, 5);
  gamma ~ normal(0.4, 0.5);
  
  // likelihood
  cases ~ neg_binomial_2(ys[,2], phi);
}
generated quantities {
  int yrep[N_t] = neg_binomial_2_rng(ys[,2], phi);
  real log_lik[N_t];
  for (i in 1:N_t)
    log_lik[i] = neg_binomial_2_lpmf(cases[i] | ys[i,2], phi);
}
sim_data <- list(N = 1000, 
                 i0 = 1, 
                 N_t = nrow(sim_cases), 
                 t = sim_cases$t, 
                 cases = sim_cases$cases)
sir_fit <- basic_sir$sample(data = sim_data, seed = params$seed, refresh = 0)
sir_fit$save_object(here::here('fits', 'basic_sir_fit.Rds'))

Posterior estimates

Next, we use tidybayes to summarize inferences for key parameters. Formatting posterior estimates takes some care since we need to join indices back with the original data.

library(tidybayes)
fit_draws <- sir_fit %>%
  gather_draws(yrep[i], ys[i, state_index]) %>%
  ungroup() %>%
  left_join(sim_cases %>% mutate(i = row_number())) %>%
  left_join(tibble(state = c('S', 'I', 'R')) %>% 
              mutate(state_index = row_number())) %>%
  mutate(state = if_else(is.na(state), .variable, state)) %>%
  select(state, .value, .chain, .iteration, .draw, cases, t) %>%
  spread(state, .value)

Once formatted, we can use tidybayes to summarize flexibly:

fit_draws %>%
  gather(state, .value, S, I, R) %>%
  ggplot(aes(x = t, y = .value, group = state, fill = state, colour = state)) +
  stat_lineribbon(.width = c(0.8, 0.9), alpha = 0.4) +
  ggtitle('Posterior estimates of population dynamics') +
  labs(caption = 'Showing median and 80% CrI')

Parameters

Example: Flu outbreak

Data

Now that we have a base model, let’s test it on some real data.

There is a canonical dataset describing a particularly virulent flu outbreak at a boarding school in 1978. Later, it was determined that the flu was H1N1, however this was not known during the event.

These data are available via the outbreaks library.

We are given the following in the data description:

The index case was infected by 1978-01-10, and had febrile illness from 1978-01-15 to 1978-01-18. 512 boys out of 763 became ill.

The data include daily counts of children both in bed, and who are convalescing.

It’s worth noting that the minimal date in our data is: 1978-01-22.

These data are missing details of the index case. We will hard-code this information as our initial conditions.

We will also ignore the “convalescent” category, for now, and lump them in with the “recovered” group (neither infectious nor susceptible).

Fitting the model

# prepare data
flu_N_t <- 763
date0 <- min(influenza_england_1978_school$date)
flu_t <- as.integer(influenza_england_1978_school$date - date0)
flu_cases <- influenza_england_1978_school$in_bed

flu_data <- list(N = flu_N_t, 
                 i0 = 1, 
                 N_t = length(flu_t), 
                 t = flu_t, 
                 cases = flu_cases)
                 
# sample model
flu_fit <- basic_sir$sample(data = flu_data, 
                            seed = params$seed, refresh = 0)

Prepare posterior estimates

As before, we format the posterior estimates & summarize the expected dynamics.

library(tidybayes)
fit_draws <- flu_fit %>%
  gather_draws(yrep[i], ys[i, state_index]) %>%
  ungroup() %>%
  left_join(tibble(t = flu_t, cases = flu_cases) %>% mutate(i = row_number())) %>%
  left_join(tibble(state = c('S', 'I', 'R')) %>% 
              mutate(state_index = row_number())) %>%
  mutate(state = if_else(is.na(state), .variable, state)) %>%
  select(state, .value, .chain, .iteration, .draw, cases, t) %>%
  spread(state, .value)
fit_draws %>%
  gather(state, .value, S, I, R) %>%
  ggplot(aes(x = t, y = .value, group = state, fill = state, colour = state)) +
  stat_lineribbon(.width = c(0.8), alpha = 0.4) +
  geom_point(aes(y = cases, 
                 colour = NULL, 
                 fill = NULL)) +
  scale_y_continuous('N') +
  scale_x_continuous('time') +
  ggtitle('Posterior estimates of population dynamics for flu outbreak') +
  labs(caption = str_c('Ribbons show posterior median and 80% CrI', 
                       'Points reflect observed values',
                       sep = "\n")
       )

Predictive checks

In addition, we compare the distribution of expected counts against that of the observed data.

fit_draws %>%
  ggplot(aes(x = yrep, colour = 'predicted', fill = 'predicted')) +
  stat_halfeye(alpha = 0.3) +
  stat_histinterval(aes(x = cases, colour = 'observed', fill = 'observed'), data = tibble(cases = flu_cases),
                    alpha = 0.2) +
  ggtitle('Observed vs Predicted Case Counts for Flu Outbreak') +
  scale_fill_discrete('Type') +
  scale_colour_discrete('Type') +
  scale_y_continuous('Density') +
  scale_x_continuous('Value')

There is a higher incidence of counts in the 250-person range than expected by the model. However, it’s possible that our model is still useful, though imperfect.

Parameters for Flu outbreak

Covid-19 data

The assumptions of the basic SIR model do not apply to most pandemics.

Typically, it is a starting point. Let’s consider COVID-19 as an example.

COVID-19 data

Here are some example case counts from Switzerland

Fitting the basic model

covid_N_t<- 5e+03
df_swiss <- df_swiss %>% 
  filter(report_dt > 0)
date0 <- min(df_swiss$date)
covid_t <- as.integer(df_swiss$date - date0)
covid_cases <- df_swiss$report_dt

fit_file <- here::here('fits', 'covid-basic_sir_fit3.Rds')
if (!fs::file_exists(fit_file)) {
  covid_data <- list(N = covid_N_t, 
                   i0 = 1, 
                   N_t = length(covid_t), 
                   t = covid_t, 
                   cases = covid_cases)
  covid_fit <- basic_sir$sample(data = covid_data, seed = params$seed, refresh = 0)
  covid_fit$save_object(fit_file)
} else {
  covid_fit <- readRDS(fit_file)
}

Formatting posterior estimates takes some care

library(tidybayes)
fit_draws <- covid_fit %>%
  gather_draws(yrep[i], ys[i, state_index]) %>%
  ungroup() %>%
  left_join(tibble(t = covid_t, cases = covid_cases) %>% mutate(i = row_number())) %>%
  left_join(tibble(state = c('S', 'I', 'R')) %>% 
              mutate(state_index = row_number())) %>%
  mutate(state = if_else(is.na(state), .variable, state)) %>%
  select(state, .value, .chain, .iteration, .draw, cases, t) %>%
  spread(state, .value)

In this fit, we are providing an assumed value for N, the total population size. However, the actual number of susceptible (effectively susceptible) individuals is not known.

We also see a pretty marked discrepancy between the observed counts and the model estimates.

Both suggest we should revise the model, perhaps to estimate N.

Fitting a revised model

Now, we modify our model to estimate N rather than use an assumed value.

Here is the Stan program for the revised model.

functions {
  real[] sir2(real t, real[] y, real[] theta, 
             real[] x_r, int[] x_i) {

      real beta = theta[1];
      real gamma = theta[2];
      real initS = theta[3];
      real initI = theta[4];

      real S = y[1] + initS;
      real I = y[2] + initI;
      real R = y[3];
      real N = S + I + R;
      real dy_dt[3];

      dy_dt[1] = -beta * I * S / N;
      dy_dt[2] =  beta * I * S / N - gamma * I;
      dy_dt[3] =  gamma * I;
      return(dy_dt);
  }
  real[,] simulate_sir2(real[] t, real initS, real initI, real beta, real gamma) {
    int N_t = num_elements(t);
    real t0 = -0.1;
    real theta[4] = {beta, gamma, initS, initI};
    real y[N_t, 3];
    y = integrate_ode_rk45(sir2, rep_array(0.0, 3), t0, t, theta, rep_array(0.0, 0), rep_array(0, 0));
    return(y);
  }
  
  int[] simulate_cases2_rng(real[] t, real initS, real initI, real beta, real gamma, real phi) {
    int N_t = num_elements(t);
    real y[N_t, 3] = simulate_sir2(t, initS, initI, beta, gamma);
    int cases[N_t] = neg_binomial_2_rng(y[,2], phi);
    return(cases);
  }
}
data {
  int N_t;
  real t[N_t];
  int cases[N_t];
}
parameters {
  real<lower = 0> beta;
  real<lower = 0> gamma;
  real<lower = 0> phi_inv;
  real log_initS_raw;
  real<lower = 0> initI;
}
transformed parameters {
  real phi = 1. / phi_inv;
  real<lower = 0> initS = 1000 + exp(log_initS_raw);
  real ys[N_t, 3] = simulate_sir2(t, initS, initI, beta, gamma);
}
model {
  // priors
  phi_inv ~ exponential(5);
  beta ~ normal(0, 5);
  gamma ~ normal(0.4, 0.5);
  log_initS_raw ~ normal(5, 10);
  initI ~ std_normal();
  
  // likelihood
  cases ~ neg_binomial_2(ys[,2], phi);
}
generated quantities {
  int yrep[N_t] = neg_binomial_2_rng(ys[,2], phi);
  real log_lik[N_t];
  for (i in 1:N_t)
    log_lik[i] = neg_binomial_2_lpmf(cases[i] | ys[i,2], phi);
}

Next we formatting posterior estimates

library(tidybayes)
fit_draws2 <- covid_fit2 %>%
  gather_draws(yrep[i], ys[i, state_index]) %>%
  left_join(covid_fit2 %>% spread_draws(initS, initI)) %>%
  ungroup() %>%
  left_join(tibble(t = covid_t, cases = covid_cases) %>% mutate(i = row_number())) %>%
  left_join(tibble(state = c('S', 'I', 'R')) %>% 
              mutate(state_index = row_number())) %>%
  mutate(state = if_else(is.na(state), .variable, state)) %>%
  select(state, .value, .chain, .iteration, .draw, cases, t, initS, initI) %>%
  spread(state, .value) %>%
  mutate(S = S + initS,
         I = I + initI)

Parameters from revised model

Model extensions

In practice, the SIR model is a starting point.

There are any number of extensions that have been used, not only for COVID-19 but for other outbreaks.

8-compartment model

As an example, consider this 8-compartment model used to describe COVID-19 disease dynamics.

The observed data include daily counts of Case, Hospitalization, and Death counts. Each is modeled using a negative binomial given the estimated true prevalence.

src: Keller, JP, Zhou, T, Kaplan, A, Anderson, GB, Zhou, W. Tracking the transmission dynamics of COVID-19 with a time-varying coefficient state-space model. Statistics in Medicine. 2022; 41( 15): 2745- 2767. doi:10.1002/sim.9382

Covariates

In addition, their model includes particular covariate effects on time-varying infectiousness parameters.

For example, one informative covariate was the degree of mobility (from aggregate cell-phone tracking data)

Keller, JP, Zhou, T, Kaplan, A, Anderson, GB, Zhou, W. Tracking the transmission dynamics of COVID-19 with a time-varying coefficient state-space model. Statistics in Medicine. 2022; 41( 15): 2745- 2767. doi:10.1002/sim.9382

Addendum

Sources

Most of the content for this presentation is based on the following case study:

  • https://mc-stan.org/users/documentation/case-studies/boarding_school_case_study.html#stan_codes

In addition, I report and am motivated by the work in this paper:

  • Keller, JP, Zhou, T, Kaplan, A, Anderson, GB, Zhou, W. Tracking the transmission dynamics of COVID-19 with a time-varying coefficient state-space model. Statistics in Medicine. 2022; 41( 15): 2745- 2767. doi:10.1002/sim.9382

Full code available at:

Learn Stan